Search for: All records

Creators/Authors contains: "Kalbarczyk, Zbigniew T."

« Prev Next »

Total Resources

7

Resource Type
Conference Paper

7

Conference Proceeding

0

Dataset

0

Journal Article

0

Workshop Report

0

Availability
Full Text / Resource Available

6

Citation Only

1

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AWARE: Automate Workload Autoscaling with Reinforcement Learning in Production Cloud Systems

Qiu, Haoran ; Mao, Weichao ; Wang, Chen ; Franke, Hubertus ; Yousseff, Alaa ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K ( July 2023 , 2023 USENIX Annual Technical Conference (USENIX ATC 23))

Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. Reinforcement learning (RL) has recently been proposed and applied in various systems tasks, including resource management. In this paper, we first characterize the state-of-the-art RL approaches for workload autoscaling in a public cloud and point out that there is still a large gap in taking the RL advances to production systems. We then propose AWARE, an extensible framework for deploying and managing RL-based agents in production systems. AWARE leverages meta-learning and bootstrapping to (a) automatically and quickly adapt to different workloads, and (b) provide safe and robust RL exploration. AWARE provides a common OpenAI Gym-like RL interface to agent developers for easy integration with different systems tasks. We illustrate the use of AWARE in the case of workload autoscaling. Our experiments show that AWARE adapts a learned autoscaling policy to new workloads 5.5x faster than the existing transfer-learning-based approach and provides stable online policy-serving performance with less than 3.6% reward degradation. With bootstrapping, AWARE helps achieve 47.5% and 39.2% higher CPU and memory utilization while reducing SLO violations by a factor of 16.9x during policy training.
more » « less
Free, publicly-accessible full text available July 1, 2024
SIMPPO: a scalable and incremental online learning framework for serverless resource management

https://doi.org/10.1145/3542929.3563475

Qiu, Haoran ; Mao, Weichao ; Patke, Archit ; Wang, Chen ; Franke, Hubertus ; Kalbarczyk, Zbigniew T. ; Başar, Tamer ; Iyer, Ravishankar K. ( November 2022 , Proceedings of the 13th ACM Symposium on Cloud Computing (SoCC 2022))

Serverless Function-as-a-Service (FaaS) offers improved programmability for customers, yet it is not server-“less” and comes at the cost of more complex infrastructure management (e.g., resource provisioning and scheduling) for cloud providers. To maintain function service-level objectives (SLOs) and improve resource utilization efficiency, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to rule-based solutions with heuristics, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. Despite the initial success of applying RL, we first show in this paper that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.8x higher p99 function latency degradation on multi-tenant serverless FaaS platforms compared to isolated environments and is unable to converge during training. We then design and implement a scalable and incremental multi-agent RL framework based on Proximal Policy Optimization (SIMPPO). Our experiments on widely used serverless benchmarks demonstrate that in multi-tenant environments, SIMPPO enables each RL agent to efficiently converge during training and provides online function latency performance comparable to that of S-RL trained in isolation (which we refer to as the baseline for assessing RL performance) with minor degradation (<9.2%). In addition, SIMPPO reduces the p99 function latency by 4.5x compared to S-RL in multi-tenant cases.
more » « less
Full Text Available
Reinforcement learning for resource management in multi-tenant serverless platforms

https://doi.org/10.1145/3517207.3526971

Qiu, Haoran ; Mao, Weichao ; Patke, Archit ; Wang, Chen ; Franke, Hubertus ; Kalbarczyk, Zbigniew T. ; Başar, Tamer ; Iyer, Ravishankar K. ( April 2022 , EuroMLSys 2022 - Proceedings of the 2nd European Workshop on Machine Learning and Systems)

Serverless Function-As-A-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-The-Art single-Agent RL algorithm (S-RL) suffers up to 4.6x higher function tail latency degradation on multi-Tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-Agent RL algorithm based on Proximal Policy Optimization, i.e., multi-Agent PPO (MA-PPO). We show that in multi-Tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-Tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4x improvement in S-RL performance (in terms of function tail latency) in multi-Tenant cases.
more » « less
Full Text Available
Is Function-as-a-Service a Good Fit for Latency-Critical Services?

https://doi.org/10.1145/3493651.3493666

Qiu, Haoran ; Jha, Saurabh ; Banerjee, Subho S. ; Patke, Archit ; Wang, Chen ; Hubertus, Franke ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K. ( December 2021 , WoSC '21: Proceedings of the Seventh International Workshop on Serverless Computing (WoSC7) 2021)

Function-as-a-Service (FaaS) is becoming an increasingly popular cloud-deployment paradigm for serverless computing that frees application developers from managing the infrastructure. At the same time, it allows cloud providers to assert control in workload consolidation, i.e., co-locating multiple containers on the same server, thereby achieving higher server utilization, often at the cost of higher end-to-end function request latency. Interestingly, a key aspect of serverless latency management has not been well studied: the trade-off between application developers' latency goals and the FaaS providers' utilization goals. This paper presents a multi-faceted, measurement-driven study of latency variation in serverless platforms that elucidates this trade-off space. We obtained production measurements by executing FaaS benchmarks on IBM Cloud and a private cloud to study the impact of workload consolidation, queuing delay, and cold starts on the end-to-end function request latency. We draw several conclusions from the characterization results. For example, increasing a container's allocated memory limit from 128 MB to 256 MB reduces the tail latency by 2× but has 1.75× higher power consumption and 59% lower CPU utilization.
more » « less
Full Text Available
FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices

Qiu, Haoran ; Banerjee, Subho S. ; Jha, Saurabh ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K. ( November 2020 , Proceedings of The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘20))
null (Ed.)
User-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16Å~ while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11Å~.
more » « less
Full Text Available
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems

https://doi.org/10.5555/3433701.3433787

Jha, Saurabh ; Cui, Shengkun ; Banerjee, Subho ; Xu, Tianyin ; Enos, Jeremy ; Showerman, Mike ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K. ( November 2020 , Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC 2020))
null (Ed.)
Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
more » « less
Full Text Available
Availability attacks on computing systems through alteration of environmental control: smart malware approach

https://doi.org/10.1145/3302509.3311041

Chung, Keywhan ; Kalbarczyk, Zbigniew T. ; Iyer, Ravishankar K. ( January 2019 , ACM/IEEE International Conference on Cyber-Physical Systems)

In this paper, we demonstrate the feasibility of smart malware that advances state-of-the-art attacks by (i) indirectly attacking a computing infrastructure through a cyber-physical system (CPS) that manages the environment in which the computing enterprise operates, (ii) disguising its malicious actions as accidental failures, and (iii) self-learning attack strategies from cyber-physical system measurement data. We address all aspects of the malware, including the construction of the self-learning malware and the launch of a failure injection attack. We validate the attacks in a data-driven CPS simulation environment developed as part of this study.
more » « less
Full Text Available